This course is an open datascience course using Git and R. My github repository is at https://github.com/KatariinaParnanen/IODS-project
This week we have learned about data wrangling using R and dplyr package. We also are introduced to basic regression analysis and model validation using diagnostic plots.
We will first read in data from local directory. The data is based on a questionaire done on students of a course and their exam points.
#Read in table
lrn2014<-read.table("/Users/kparnane/Documents/IODS_course/IODS-project/data/learning2014.txt", header=TRUE, sep="\t")
We will explore the dataset’s dimension and structure.
#Explore the dimensions
dim(lrn2014)
## [1] 166 7
#Explore the structure of the file
str(lrn2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
The data has 166 observations and 7 variable columns.
The dataset has a response variable “points”" and explanatory variables “age”, “attitude”, “deep”, “surf” and “stra”. “deep” means points for deep learning questions, “surf” points for surface learning questions, “stra” means strategic learning techniques. “points” means exam score. More info on the dataset can be found here.
We will explore correlations and distributions of explanatory and response factors using “ggpairs” which plots pairwise correlations and variable disturbutions.
#Load required packages
library(GGally)
library(ggplot2)
#Draw plot
ggpairs(lrn2014, mapping = aes(col = gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
Pairwise correlation plot: Exploration of variables and correlations
#Get summaries of variables
summary(lrn2014)
## gender age attitude points deep
## F:110 Min. :17.00 Min. :14.00 Min. : 7.00 Min. :1.583
## M: 56 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:19.00 1st Qu.:3.333
## Median :22.00 Median :32.00 Median :23.00 Median :3.667
## Mean :25.51 Mean :31.43 Mean :22.72 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:27.75 3rd Qu.:4.083
## Max. :55.00 Max. :50.00 Max. :33.00 Max. :4.917
## surf stra
## Min. :1.583 Min. :1.250
## 1st Qu.:2.417 1st Qu.:2.625
## Median :2.833 Median :3.188
## Mean :2.787 Mean :3.121
## 3rd Qu.:3.167 3rd Qu.:3.625
## Max. :4.333 Max. :5.000
There are more females than males. Other factors besides age seem to roughly follow a normal distribution. Mean age is 25. We are interested in the response variable points and how the explanatory variables corrrelate with it. The highest correlations are with “attitude”, “stra” and there is a negative correlation with “surf”.
Now we use the three variables with possible correlations with exam points identified in the data exploration in linear models.
# fit a linear model
my_model <- lm(points ~ attitude+stra+surf, data = lrn2014)
# print out a summary of the model
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## attitude 0.33952 0.05741 5.913 1.93e-08 ***
## stra 0.85313 0.54159 1.575 0.11716
## surf -0.58607 0.80138 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Variable “attitude” is positively correlated with points with a p-value of 1.93e-08 and estimate 0.33952. The other variables are not significantly correlated with exam points so they can be excluded from the model.
Drop the not significant variables and fit the model again.
# fit a linear model
my_model2 <- lm(points ~ attitude, data = lrn2014)
#Plot the regression line in scatter plot with exam points versus attitude
qplot(attitude, points, data = lrn2014) + geom_smooth(method = "lm")
Scatter plot with regression line: Relationship of points and attitude
Print summary of the model.
# print out a summary of the model
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude, data = lrn2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
There is a significant relationship with the variable attitude exam points. The attitude variable has a p-value of 1.95e-09, estimate of 0.35255 and standard error of 0.05674. The R-squared value is 0.1856 meaning that attitude variable explains roughly 19% of the variation in the model.
We will produce diagnostic plots to see if the model fits the assumptions of linear models. The assumptions are that the size of the error is not dependent on the variable value and the errors are normally distributed and that the explanatory variables are not correlated with each other. The dependent variables should also be independent from each other.
# Draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
par(mfrow = c(2,2))
plot(my_model2, which=c(1, 2, 5))
Diagnostic plots: Plots for exploring model errors
There doesn’t seem to be any patterns in the residuals vs fitted values plot so the size of the residuals are not dependent on the fitted values. The residuals in the Q-Q plot follow the assumed regression line quite well so the assumption of normal distribution of errors is filled. The residuals vs leverage plot doesn’t reveal that there are any observations which have a unusually high effect on the model. The model also has only one explanatory variable so explanatory variable autocorrelation is not a problem. The observations are also independent from each other.Thus, we can conclude that the model fits the assumptions.
This week we have learned about logistic regression, piping and model validation.
Read in data and explore.
# Get access to useful libraries
library(dplyr);library(ggplot2);library(tidyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Read in table from local folder
alc<-read.table(file = "data/alc.txt", sep="\t", header=TRUE)
#Explore the data using glimpse
glimpse(alc)
## Observations: 382
## Variables: 35
## $ school <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,...
## $ sex <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, ...
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, ...
## $ famsize <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, G...
## $ Pstatus <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, ...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob <fct> at_home, at_home, at_home, health, other, services,...
## $ Fjob <fct> teacher, other, other, services, other, other, othe...
## $ reason <fct> course, course, other, home, home, reputation, home...
## $ nursery <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, ye...
## $ internet <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes,...
## $ guardian <fct> mother, father, mother, mother, father, mother, mot...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, ...
## $ famsup <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes,...
## $ paid <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, ...
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes,...
## $ higher <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ romantic <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no...
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1 <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2 <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...
# Draw a bar plot of each variable
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped
The data has 382 observations and 35 variables from students whose alcohol consumption has been studied. The variables are either integers, numeric, factors or logicals. More information about the data can be found [here]https://archive.ics.uci.edu/ml/datasets/Student+Performance.
The data includes variables for sex, absences - absences from school, goout - going out and age. Usually there are differences between how much boys and girls drink due to social norms, absences from school also probably correlate with high use of alcohol, going out a lot also usually means higher alcohol consumption due to having more exposure to situations with alcohol and age might have an effect due to easier acccess to alcohol for older kids.
Logistic regression is one type of generalized linear models where the response variable can only have two values for example true or false. In this data set the response variable we are interested is high alochol use, which can have true or false values.
Explore how sex is related to high alcohol use. Our hypothesis is that males drink more than females.
#Explore the distributions of females and males in the data
g0<-ggplot(data = alc, aes(x=sex))
g0 + geom_bar()
#Explore how high use is dependent on sex using barplots
g1<-ggplot(data = alc, aes(x = high_use, fill = sex))
g1 + geom_bar()+facet_wrap("sex")
There are about equal numbers of males and females in the data. There seems to be more males who have high use compared to females, which makes sense if you think about the social norms hypothesis
Explore how absences are related to alcohol use using boxplots. Our hypothesis is that kids who are absent more are more likely to also drink more.
# Explore the distribution of absences
g6<-ggplot(alc, aes(x=absences, fill =sex))
g6+geom_bar() + ylab("absences") + ggtitle("Student absences by sex")
#Draw boxplots with high and low alcohol uses relation with absences
g3 <- ggplot(alc, aes(x=high_use, y=absences, col = sex))
g3 + geom_boxplot() + ylab("absences") + ggtitle("Student absences by alcohol consumption and sex")
Absences are not normally distributed. There is a long tail and the most common value is 0. It seems that there are more absences in the kids with high alcohol use. This makes sense if you think that kids who drink more are also skipping more school. The patterns seem similar between females and males.
Explore how going out is related to alcohol use. Our hypothesis is that kids who are going out more are exposed to alcohol more and thus will also drink more.
#Explore the distibution of kids going out
g4 <- ggplot(alc, aes(x=goout, fill=sex))
g4 + geom_bar()
#Explore the relationship with high alcohol use
g5 <- ggplot(alc, aes(x=high_use, y=goout, col = sex))
g5 + geom_boxplot() + ggtitle("Going out by alcohol consumption and sex")
Going out seems to be quite normally distributed between the values 1 and 5. However, there aren’t very many kids who don’t go out almost at all compared to the ones who go out very often. There is approximately equal numbers of both sexes in each going out class.
There definitely seems to be a relationship between going out and alcohol use. However, there might be some differences between going out often beign correlated with drinking more since in females there is overlap between the ones who go out often but don’t have high alcohol use with those who go out often and have high alcohol use.
Explore the relationship with age and alcohol use. Our hypothesis is that it is easier for older kids to get alcohol than for younger kids.
# Explore the distribution of age
g6<-ggplot(alc, aes(x=age, fill =sex))
g6+geom_bar()
# Explore the relationship of age with alcohol use
g7 <- ggplot(alc, aes(y=age, x=high_use, col = sex))
g7 + geom_boxplot() + ylab("age") + ggtitle("Student age by alcohol consumption and sex")
Age is not normally distributed and there is quite a long tail for higher numbers after 19. Ages range from 15 to 22 with 16 being the median. For the values 18 and up, there is an approximately equal amount of females and males for each age.
There doesn’t seem to be a very obvious link between alcohol use and age that could be seen from the boxplots since the high use boxplots overlap with the low use bowplots. It might be that it is not very difficult to get alchol in the country of the kids even when you are younger than 18.
Next we will build a GLM using binomial family with the interesting explanatory variables.
# Use generalized linear models to predict high use based on selected variables
m<-glm(high_use~age+sex+absences+goout, data=alc, family = "binomial")
#Print out summary and coefficients
summary(m)
##
## Call:
## glm(formula = high_use ~ age + sex + absences + goout, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7877 -0.8026 -0.5393 0.8294 2.5201
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.60333 1.84395 -3.039 0.002376 **
## age 0.08951 0.11003 0.814 0.415905
## sexM 0.96338 0.25490 3.779 0.000157 ***
## absences 0.08205 0.02247 3.651 0.000261 ***
## goout 0.71753 0.12058 5.951 2.67e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 387.09 on 377 degrees of freedom
## AIC: 397.09
##
## Number of Fisher Scoring iterations: 4
#From the summary we can see that sex, absences and going out are significantly correlated with high alcohol use
coef(m)
## (Intercept) age sexM absences goout
## -5.60333157 0.08951186 0.96338497 0.08204599 0.71752626
#From the coefficients we can calculate the odds ratio by taking the exponent function of the estimate
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI<- confint(m) %>% exp
## Waiting for profiling to be done...
# Join the two together
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.003685565 9.268741e-05 0.1296785
## age 1.093640305 8.815211e-01 1.3582243
## sexM 2.620551969 1.599662e+00 4.3546202
## absences 1.085505735 1.040056e+00 1.1371493
## goout 2.049357367 1.626970e+00 2.6130756
From the odds ratio we can see that for every year the is approximately 9% increase in the chances of having high alcohol use, but the confidence interval includes one, which means that age is not significant.
We can also see that sex male means a 2.6 times more likely chance of having high alcohol use. Female is the intercept factor. The CI does not include 1 so this going out is also significant.
One more absence from school causes an approximately 9% bigger chance of having high use and this is significant also the CI doesn’t include 1.
Going out from a scale to 1 to 5 is significantly correlated with high alcohol use as the CI doesn’t include 1. An increasement of 1 causes an approximately 2.6 more likely chance of having high use.
After obtaining the model, we will use cross validation with a training set to estimate the error rate of the model. We will use the model without age since age was not significant.
# Fit the model without age
m<-glm(high_use~sex+absences+goout, data=alc, family = "binomial")
# Predict the probability of high alcohol use
probabilities <- predict(m, type = "response")
# Add the predicted probabilities to the alc table
alc <- mutate(alc, probability = probabilities)
# Use the probabilities to make a prediction of high alcohol use as probabilitites instead of odds
alc <- mutate(alc, prediction = probabilities>0.5)
# See the last ten original classes, predicted probabilities, and class predictions
select(alc, failures, absences, sex, high_use, probability, prediction) %>% tail(10)
## failures absences sex high_use probability prediction
## 373 1 0 M FALSE 0.14869987 FALSE
## 374 1 7 M TRUE 0.39514446 FALSE
## 375 0 1 F FALSE 0.13129452 FALSE
## 376 0 6 F FALSE 0.18714923 FALSE
## 377 1 2 F FALSE 0.07342805 FALSE
## 378 0 2 F FALSE 0.25434555 FALSE
## 379 2 2 F FALSE 0.07342805 FALSE
## 380 0 3 F FALSE 0.03989428 FALSE
## 381 0 4 M TRUE 0.68596604 TRUE
## 382 0 2 M TRUE 0.09060457 FALSE
# From here you can see than when prob is >0.5 the prediction is "TRUE"
# Cross tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.66230366 0.03926702 0.70157068
## TRUE 0.17015707 0.12827225 0.29842932
## Sum 0.83246073 0.16753927 1.00000000
# From here we can see that out of 65/318 predictions are wrong for predicted low alcohol use and 15/64 are wrong for the prediction of high alcohol use.
# Define a loss function to get the mean prediction error
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# Call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2094241
# Do K-fold cross-validation using the loss_func defined previously
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# Average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2120419
There is approximately 22 % error in the prediction of high alcohol use using the model wiht going out, absences and age.
This week we have learned about clustering, which is a handy tool for visualizing differences between data points. Clustering can be used for example for data exploration. We also learned about classification, which can be used to validate the results of clustering. Discriminant analysis was also touched upon.
We will load the dataset “Boston” from MASS package. The dataset includes data on crime rates per capita for towns in Boston with explanatory variables related to land use and inhabitant demographics.
# Access the MASS package
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(dplyr)
# Load the data
data("Boston")
# Explore the dataset's dimentions and details of the variables
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
#L
library(corrplot)
## corrplot 0.84 loaded
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston)%>%round(digits=2)
# print the correlation matrix
cor_matrix
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex=0.6)
There seems to be highest correlations with rad = index of accessibility to radial highways, tax=full-value property-tax rate per $10,000, lsat=lower status of the population (percent) and indus=proportion of non-retail business acres per town variables and lowest with medv=median value of owner-occupied homes in $1000s, black=1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town and dis=weighted mean of distances to five Boston employment centres. Some of the variables have only very few high or low values which might cause good correlations, but the correlations might not be significant.
Next we will scale the variables using the scale function.
# Center and standardize variables
boston_scaled <- scale(Boston)
# Summaries of the scaled variables
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# Class of the boston_scaled object is matrix
class(boston_scaled)
## [1] "matrix"
# Change the matrix to a data frame
boston_scaled<-as.data.frame(boston_scaled)
# Summary of the scaled crime rate
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
# Create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# Create a categorical variable 'crime' using the quantiles as break points
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))
# Look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# Remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# Add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
The values are changed to distances from mean values (centering) and scaling by dividing the centered columns by their standard deviations.
Next we will create training and test sets from the data. We will use 80% of the data for training and then use the rest for testing our model.
# Number of rows in the Boston dataset
n <- nrow(boston_scaled)
# Choose 80% of the rows randomly
ind <- sample(n, size = n * 0.8)
# Create train set
train <- boston_scaled[ind,]
Next we will fit a linear disriminant analysis for the data using the lda() command with all the explanatory variables.
# Linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2475248 0.2599010 0.2524752 0.2400990
##
## Group means:
## zn indus chas nox rm
## low 0.91934427 -0.9176465 -0.15421606 -0.8550746 0.4900473
## med_low -0.09767078 -0.2730274 -0.04735191 -0.5540392 -0.1585990
## med_high -0.39307863 0.1795891 0.22945822 0.4053551 0.1071106
## high -0.48724019 1.0172187 -0.06938576 1.0285090 -0.4238136
## age dis rad tax ptratio black
## low -0.8407054 0.8151475 -0.6936060 -0.7262326 -0.3970232 0.3896097
## med_low -0.3389343 0.3168673 -0.5509226 -0.4704292 -0.0494064 0.3162580
## med_high 0.4402226 -0.3734726 -0.4076377 -0.3047673 -0.2996246 0.1063285
## high 0.7927422 -0.8542498 1.6371072 1.5133254 0.7795879 -0.8579791
## lstat medv
## low -0.77633602 0.55889263
## med_low -0.12652733 -0.02520945
## med_high 0.04402535 0.16784243
## high 0.84339141 -0.68004621
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.07448495 0.66504070 -0.85566195
## indus 0.05275595 -0.22841003 0.43845619
## chas -0.08676219 -0.10296902 0.08794376
## nox 0.36365407 -0.78109337 -1.37268666
## rm -0.14817813 -0.09301656 -0.24698636
## age 0.26769510 -0.32788474 -0.17396315
## dis -0.01794409 -0.30329234 0.04827372
## rad 3.22414317 0.82169610 0.04397460
## tax -0.04574639 0.08478246 0.34117729
## ptratio 0.09153638 0.02385299 -0.31646042
## black -0.14964951 0.01634448 0.12435279
## lstat 0.17341707 -0.23414633 0.24093020
## medv 0.20269644 -0.35522444 -0.28033194
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9482 0.0379 0.0139
From were we see that LDA1 explains >90% of the variance, which means that the model explains a great deal of the variation in the data. In the summary we can also see the means of the explanatory variables for different crime classes. For example rad and tax have quite big differences in their means between the low and high crime rates.
Next we will draw a plot and see how the different towns group in clustering. We will color and annotate the dots with crime rates and draw arrows for directions and weights of the different explanatory variables.
# The function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# Target classes as numeric
classes <- as.numeric(train$crime)
# Plot the lda results
plot(lda.fit, dimen = 2,col = classes, pch=classes)
lda.arrows(lda.fit, myscale = 2)
From the plot it seems that rad (accessibility to radial highways) has the biggest impact on grouping. There are two clusters in the plot with one of them having almost al the highest crime rate towns and some medium high rates but none of the medium low or low crime rate towns. Seems like the model might only need rad to predict the crime rates.
Next we will validate the results using the test set we made earlier. We will predict crime rate classes with test data using the lda.fit model we created with the training set.
# Create test set
test <- boston_scaled[-ind,]
# Save the correct classes from test data
correct_classes <- test$crime
# Remove the crime variable from test data
test <- dplyr::select(test, -crime)
## Predict classes with test data using the lda.fit model we created with the training set.
lda.pred <- predict(lda.fit, newdata = test)
# Cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 17 10 0 0
## med_low 4 14 3 0
## med_high 1 5 17 1
## high 0 0 0 30
From the cross tablulation, we can see that 20/20 high crime rates were predicted correctly to be high. However, two med_high cases were incorrectly classified as high. This is pretty good! For the low class, all of the correct lows were predicted to be lows, but 12 were predicted as med_low and one as meg_high.
The model strugles with correct classification of the med_low and med_high classes. Only six of the med_high classes was correctly classified and 20/26 were predicted to be some other class. Most were predicted to be med_low, but two were high and one low.
The model does a bit better with the med_low class. 17/28 are correctly classified as med_low, and five as low and 6 as med_high.
Next we will reload the Boston dataset, scale it and use k-means algorithm to predict the optimal number of clusters in the dataset.
# Obtain dataset
data('Boston')
# Change the matrix to a data frame
boston_scaled2<-as.data.frame(Boston)
# K-means clustering
km <-kmeans(boston_scaled2, centers = 3)
# Plot the Boston dataset with clusters, use first seven variables
pairs(boston_scaled2[1:14], col = km$cluster)
The variables rad and tax seem to cluster the data pretty nicely into two clusters in the crime vs rad and crime vs tax plots. Three clusters might not be needed, but two might suffice. Let’s check this next.
We will determine the number of clusters is to look by how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. The best number of clusters is when the total WCSS drops dramatically, which can be observed in the plot.
# MASS, ggplot2 and Boston dataset are available
set.seed(123)
# Determine the maximun number of clusters
k_max <- 10
# Calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})
# Visualize the results with qplot
qplot(x = 1:k_max, y = twcss, geom = 'line')
# k-means clustering
km <-kmeans(boston_scaled2, centers = 2)
# Plot the Boston dataset with clusters
pairs(boston_scaled2, col = km$cluster)
Like we observed in the previous pairs plot, based on the qplot and WCSS it seems that the optimal number of clusters is 2. Rad and tax produce the best separation betweem the classes.